



MICROCOPY RESOLUTION TEST CHART
NATIONAL BUREAU OF STANDARDS - 1963 - A

# AD-A160 206

Conflict Sensitivity of Algorithms
Part I: A CRAY X-MP Study

D. A. Calahan

March 15, 1985

Sponsored by
Air Force Office of Scientific Research
Los Alamos National Laboratory



ITIC FILE COPY

Supercomputer Algorithm Research Laboratory

Department of Electrical Engineering & Computer Science



### Conflict Sensitivity of Algorithms Part I: A CRAY X-MP Study

D. A. Calahan

March 15, 1985

Sponsored by
Air Force Office of Scientific Research
under Grant 84-0096

and

Los Alamos National Laboratory

ATRONOMIC

TO STATE OF THE STAT

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | REPORT DOCUME                     | NTATION PAGE                                       | <u> </u>           |                   |           |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------|----------------------------------------------------|--------------------|-------------------|-----------|--|--|--|--|
| 18. REPORT SECURITY CLASSIFICATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                   | 1b. RESTRICTIVE MARKINGS                           |                    |                   |           |  |  |  |  |
| Unclassified                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | <del></del>                       |                                                    |                    |                   |           |  |  |  |  |
| 28. SECURITY CLASSIFICATION AUTHORITY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                   | 3. DISTRIBUTION/AVAILABILITY OF REPORT             |                    |                   |           |  |  |  |  |
| 2b. DECLASSIFICATION/DOWNGRADING SCHED                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | IUI E                             | Approved for Public release; distibution Unlimited |                    |                   |           |  |  |  |  |
| _ N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                   | OII I IIII CEG                                     |                    |                   | 1         |  |  |  |  |
| 4. PERFORMING ORGANIZATION REPORT NUM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | BER(S)                            | 5. MONITORING ORGANIZATION REPORT NUMBER(S)        |                    |                   |           |  |  |  |  |
| SARL #7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                   | AFOSR - TR . E 7 6 4                               |                    |                   |           |  |  |  |  |
| 6a. NAME OF PERFORMING ORGANIZATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 6b. OFFICE SYMBOL (If applicable) | 7a. NAME OF MONIT                                  | ORING ORGAN        | ZATION            |           |  |  |  |  |
| University of Michigan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | (17 applicable)                   | AFOSR                                              |                    |                   |           |  |  |  |  |
| 6c. ADDRESS (City, State and ZIP Code)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                   | 7b. ADDRESS (City,                                 | State and ZIP Cod  | le)               |           |  |  |  |  |
| Dept. of Elec. Eng. & Compute                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | er Science                        | Bldg. 410                                          |                    |                   |           |  |  |  |  |
| Ann Arbor, MI 48109                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                   | Bolling AFE                                        | 3, D.C. 20         | 1332              |           |  |  |  |  |
| Se. NAME OF FUNDING/SPONSORING ORGANIZATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 8b. OFFICE SYMBOL (If applicable) | 9. PROCUREMENT I                                   | NSTRUMENT ID       | ENTIFICATION N    | UMBER     |  |  |  |  |
| AFOSR                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | NM                                | AF0SR-84-00                                        | 096                |                   |           |  |  |  |  |
| Bc. ADDRESS (City, State and ZIP Code)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                   | 10. SOURCE OF FUN                                  | IDING NOS.         |                   |           |  |  |  |  |
| Bldg. 410<br>Bolling AFB, D.C. 20332-6448                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 3                                 | PROGRAM<br>ELEMENT NO.                             | PROJECT<br>NO.     | TASK<br>NO.       | WORK UNIT |  |  |  |  |
| •                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 61102F                            | 2304                                               | A3                 |                   |           |  |  |  |  |
| 11. TITLE (Include Security Classification) Confli                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                   |                                                    |                    |                   |           |  |  |  |  |
| of Algorithms Part I: A CRAY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | X-MP Study                        | L                                                  |                    | <u> </u>          |           |  |  |  |  |
| 12. PERSONAL AUTHOR(S)  D. A. Calahan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                   |                                                    |                    |                   |           |  |  |  |  |
| 13a. TYPE OF REPORT 13b. TIME CO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | OVERED                            | 14. DATE OF REPOR                                  | T (Yr., Mo., Day)  | 15. PAGE C        | OUNT      |  |  |  |  |
| Interim FROM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | то                                | 15 March 19                                        | 985                | 23                |           |  |  |  |  |
| 16. SUPPLEMENTARY NOTATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                   |                                                    |                    |                   |           |  |  |  |  |
| 17. COSATI CODES                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 18. SUBJECT TERMS (C              | ontinue on reverse if ne                           | cessary and identi | fy by block numbe | r)        |  |  |  |  |
| FIELD GROUP SUB. GR.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Linear algebr                     | a, Computer me                                     | emories. Su        | percomputer       | ·s        |  |  |  |  |
| XXXXXXXXXXXXXXX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 2117021 a.gas.                    | a, comparer m                                      |                    |                   |           |  |  |  |  |
| 19. ABSTRACT (Continue on reverse if necessary and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | l identify by block number        | <del></del>                                        | · ·                |                   |           |  |  |  |  |
| The delay of algorithm execution due to memory conflicts in a 16-processor CRAY X-MP extension is considered. The association between memory access delays of reads and writes, and delays in the resultant algorithm execution is studied by defining an incremental algorithm delay sensitivity and relating it to simulated large-delay and random variations. It is shown that, by devision algorithms with zero incremental sensitivity, library software highly resistant to large delays may be achieved in a many-processor X-MP. |                                   |                                                    |                    |                   |           |  |  |  |  |
| 20. DISTRIBUTION/AVAILABILITY OF ABSTRAC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                   |                                                    |                    |                   |           |  |  |  |  |
| UNCLASSIFIED/UNLIMITED X SAME AS RPT.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                   | 21. ABSTRACT SECU<br>Unclassific                   |                    | CATION            |           |  |  |  |  |
| 22s. NAME OF RESPONSIBLE INDIVIDUAL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                   | 225 TELEPHONE NO                                   |                    | 22c OFFICE SYN    | BOL       |  |  |  |  |
| John P. Thomas Jr., Capt. USA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | AF                                | (202) 767-502                                      | de ,<br>6          | NM                |           |  |  |  |  |

### Abstract

The delay of algorithm execution due to memory conflicts in a 16-processor CRAY X-MP extension is considered. The association between memory access delays of reads and writes, and delays in the resultant algorithm execution is studied by defining an incremental algorithm delay sensitivity and relating it to simulated large-delay and random variations. It is shown that, by devising algorithms with zero incremental sensitivity, library software highly resistant to large access delays may be achieved in a many-processor X-MP.

### Acknowledgement

The assistance of Ken Elliott III in development of a (proprietary) CRAY X-MP simulator is acknowledged.

### TABLE OF CONTENTS

| A.  |       |             |       | rioi |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   |     |
|-----|-------|-------------|-------|------|-----|-----|-----------|--------------|-----|---|---|---|---|---|---|---|---|---|---|---|---|---|-----|
| В.  |       |             |       | CCE  |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   | 2   |
|     |       |             |       | ical |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   | 3   |
|     | 2.    | A           | Se    | nsi  | tiv | ity | , Me      | eas          | ure | • | • | • | • | • | • | • | • | • | • | • | • | • | 3   |
| SI  | IULA: | red         | SEI   | NS I | CIV | ITY | . S       | מטי          | IES |   | • | • | • | • | • | • | • | • | • |   | • | • | 5   |
| Α.  | LA    | RG E        | -DEI  | .ΔΥ  | SE  | NST | ጥፐና       | <i>)</i> T T | v . |   |   |   |   |   |   |   |   |   |   |   |   |   | 9   |
|     | EF    |             |       |      |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   |     |
|     | AN    |             |       |      |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   |     |
| D.  |       |             |       | ONS  |     |     |           |              |     |   |   |   |   |   |   |   |   |   |   |   |   |   |     |
| ALC | GORI' | <u>MH1</u>  | DE    | LAY  | RE  | SUL | <u>TS</u> | •            |     | • | • | • | • | • | • | • | • | • | • | • | • | • | 1   |
| CO  | NFLI  | <u>CT-1</u> | RES   | ST   | NT  | AI  | .GOI      | RIT          | HM. | • | • | • | • | • | • | • | • | • |   | • | • | • | 1 4 |
| Α.  | INT   | ROI         | יטטכי | 101  | ١.  |     |           | •            |     | • | • |   | • |   | • |   | • |   | • | • | • | • | 1 4 |
| В.  | LO    | CAL         | ME    | MOR: | Y U | TIL | IZ        | AT I         | ON. |   | • | • | • |   | • | • | • | • | • |   |   |   | 14  |
| c.  | NO    | IU-N        | TI    | STE  | RID | E A | CCE       | ESS          |     | • | • | • | • | • | • | • | • | • | • | • | • |   | 1 ! |

### I. ALGORITHM CONFLICT SENSITIVITY.

### A. INTRODUCTION

In a companion report [1], the effect of different memory conflict resolution protocols on delays of memory accesses was studied. These access delays produce a delay in algorithm execution or algorithm delay. However, the relationship between these two delays has not been investigated in the literature, in part because the hardware cannot measure access delay in general, and partly because a memory access simulator is far easier to develop than the full instruction-level timing simulator necessary to measure algorithm delay.

A priori, it may not seem worth correlating access and algorithm delays. An architect may be comfortable with the assumption that there is a general correspondence between the two. Indeed, it will be one of the purposes of this research to determine a rule-of-thumb relationship by examining some typical scientific application codes; the question of whether the code itself is responsible for enhancing access delays will therefore be answered. Algorithmically, however, it will be shown that codes can be designed to exploit local conflict-free memory and achieve virtual independence of main memory access delays. These will be termed conflict-resistant algorithms. Their study may have short term value in the immediate task of developing library codes for the CRAY family of multiprocessors, and long term value in establishing an additional application of cache and local memory in MP supercomputer architectures.

The experimental vehicle for this largely empirical study is a CRAY X-MP simulator. This instruction-level simulator produces

numerical and timing results; the latter are accurate to within .2% for typical codes executing on a uniprocessor X-MP-2 without interprocessor conflicts. The conflict mechanism of the X-MP-2 is also simulated and has been found to be exact for large kernels of read and write instructions only. The conflict protocol of the X-MP-2 has been adopted in extension to 16 processors (even though the X-MP-4 is known to have some variations), except that processors are paired to achieve adequate memory bandwidth for buffer fetches. More validation is given in an appendix of [1].

- B. MEMORY ACCESS VERSUS ALGORITHM DELAYS
- 1. Critical Path

ľ

The X-MP operates by a system of register and functional unit reservations. Instructions begin execution either (1) when reservations expire on the resources they require, or (2) when elements of a vector operand become available from a functional unit ("chaining"). When reads or writes are delayed by conflicts, the associated register reservations are held and chains are delayed.

A particular instruction issue and/or execution may or may not influence total execution time. For example, address formation is often masked by floating point computation in a vectorized code. When such influence does exist, the instruction is on the critical path of execution.

Determining which instructions are on this critical path is difficult even with a simulator. For example, a hold on instruction issue does not guarantee that the instruction creating the hold is on the critical path; both of the instructions may be

off the critical path and their issue time superfluous to total algorithm timing. The critical path is first a global issue.

This critical path can change as access delays lengthens; for example, a formerly masked read or write may enter the critical path when its execution is delayed and no longer masked. An instruction awaiting two reads, as V3 in the vector sequence (in CRAY assembly language)

| V0 | ,A0,1  | (vector read)     |
|----|--------|-------------------|
| V1 | ,A0,1  | (vector read)     |
| V2 | S1*V1  | (vector multiply) |
| V3 | V2+FV0 | (vector add)      |

could be delayed by conflicts on either V0 or V1; thus the algorithm delay is a function of delays on V0 and V1 reads, creating a "worst-case" risk situation.

In contrast, a "best-case" condition occurs when an instruction is awaiting availability of alternate identical resources. In the X-MP, the read port (of two ports) is chosen during execution. If both ports are busied with delayed reads, the first available one is used.

### 2. A Sensitivity Measure

In spite of the above threats to a well-behaved cause-effect relationship between memory access and algorithm delays, it is nonetheless possible to develop a meaningful sensitivity measure relating the two. Only vector access delays will be considered in the following discussion.

Define

T - algorithm execution time

 $T_{m_i}$  - time memory is busied during access of the ith vector in the critical path

 $\Delta T_{m_i}$  - change in  $T_{m_i}$ 

 $\Delta T_i$  - change in T due to  $\Delta T_{m_i}$ 

AT - total change in T due to delays in vectors.

where  $\mathbf{T}_{m}$  is VL+3 without conflicts, where VL is the vector

length. Then if only the ith instruction is delayed

$$\Delta \mathbf{T}_{i} = \Delta \mathbf{T}_{m_{i}} \tag{1}$$

and the algorithm sensitivity to a delay in the ith access is

$$S_i \stackrel{\Delta}{=} \frac{\text{fractional change in } T}{\text{fractional change in } T_{\mathfrak{m}_i}}$$
 (2)

$$= (\Delta T_{i}/T)/(\Delta T_{m_{i}}/T_{m_{i}})$$
 (3)

$$= T_{m_{i}}/T \tag{4}$$

Thus, this normalized sensitivity is not dependent on the fraction of time a read or write is in the critical path, but, once in the critical path, on its total vector length.

If m vectors are in the critical path, and if the effects on T of each vector access delay are independent of other delays (to be tested by simulation), then

$$\Delta \mathbf{T} = \sum_{i=1}^{m} \Delta \mathbf{T}_{m_i} \tag{5}$$

If all vectors are delayed by a <u>uniform</u> fraction of their lengths so that  $\Delta T_{m_i}/T_{m_i} = D$ , a constant, then define the global

sensitivity

$$S_u \stackrel{\Delta}{=} \frac{\text{fractional change in } T}{D}$$
 (6)

$$= (\Delta T/T)/D \tag{7}$$

$$= \begin{pmatrix} m \\ \Sigma & T_{m_i} \end{pmatrix} / T$$
 (8)

In the limiting case, then, if every vector access has one clock in the critical path in the conflict-free case, the total sensitivity would be proportional to the sum of all the read/write vector lengths!

### II. SIMULATED SENSITIVITY STUDIES

### A. LARGE-DELAY SENSITIVITY

The relationship of Eq. (9) merely represents the additive nature of independent delays. The practical issues are the effects of (a) large and (b) random access delays on the critical path or, equivalently, on the algorithm delay. These two effects will be measured separately by simulation.

The  $S_u$  defined in Eq. (6) can be measured, irrespective of whether Eq. (9) applies as a result of independence. By disabling the X-MP conflict resolution protocol in the simulator and instead artifically delaying all accesses a uniform fraction D of their vector lengths, the delay  $(D_u)$  in algorithm execution can be measured as a function of D. This will test the dependence of the critical path on large but uniform delays.

The result (Figure 1) shows that, for the three test codes, the slope  $S_u$  remains nearly constant for delays of up to 100% of the vector length. Thus, under the assumption of uniform delays, the hazards to the critical path disruption are insignificant for access delays far greater than likely to be encountered in practice. The incremental sensitivities  $S_u$  measured at D=0 are given in Table 1 for a large number of cases.

It should be noted that Eq. (9) has been verified by inspection of clock-level timing of MUL2 and CFD executions.

Fig. 1. Algorithm delay due to uniform access delay VL = 64 for all vectors

Numbers are slopes; dots at slope changes



Thus, algorithm delay seems likely to be independent of access delays for even large uniform delays.

### B. EFFECTS OF RANDOMNESS

With the conflict protocol enabled, the delay in algorithm execution ( $D_{al}$ ) was measured for all processors involved in a simulation, and the delays averaged ( $\overline{D}_{al}$ ). The delays of all accesses were also recorded and averaged ( $\overline{D}_{ac}$ ). These delays were normalized by dividing by the total algorithm execution time and by VL (=64 for all codes), respectively; this yielded  $\widetilde{D}_{al}$  and  $\widetilde{D}_{ac}$ . The sensitivity

$$s_{al} \stackrel{\Delta}{=} \tilde{D}_{al} / \tilde{D}_{ac}$$

is then the measure of the random, large-deviation sensitivity encountered in practice.

Table 1 indicates that  $S_u$  and  $S_{al}$  are sufficiently different that the critical path must be moderately altered in some codes. Since large uniform deviations have been shown to have nominal effects on sensitivity, one is left to conclude that it is the randomness which disrupts the critical path. This is consistent with the previous discussion of how the critical path is altered, e.g., by masking and by best-case and worst-case events.

It appears that the sensitivity to a single delayed access should be less than unity; the provision for late chaining avoids the prospect of a delayed access causing a missed chain-slot time, as in the CRAY-1. However, it is unclear whether  $S_{al}$  can be greater than unity. Nonetheless,  $S_{al} \le 1$  for all measured sensitivities (Table 2), with the largest being .939.

|                  |       |   | Utili          | zations    |   | Delay          |                |     | Se        | nsitivi        | ies                             |
|------------------|-------|---|----------------|------------|---|----------------|----------------|-----|-----------|----------------|---------------------------------|
| Code             | Banks | 1 | Ū <sub>m</sub> | <u>u</u> b |   | õ <sub>m</sub> | ñ <sub>a</sub> | 1   | Sal       | s <sub>u</sub> | s <sub>al</sub> /Ū <sub>m</sub> |
| <b>FFT</b>       |       |   |                |            |   |                |                |     |           | .417<br>.417   |                                 |
|                  |       | 1 | <b></b> .      |            | ſ |                |                | - - | . <b></b> |                |                                 |
| CFD              |       |   |                |            |   |                |                |     |           | .336<br>.336   |                                 |
|                  |       | - |                |            | - |                |                |     | ·         |                |                                 |
| MUL <sub>1</sub> |       |   |                |            |   |                |                |     |           | .464<br>.464   |                                 |
|                  |       | - |                |            | - |                |                |     | · <b></b> |                |                                 |
| MUL <sub>2</sub> |       |   |                |            |   |                |                |     |           | .602<br>.602   |                                 |
|                  |       | - |                |            | - |                |                |     |           |                |                                 |
| MUL <sub>3</sub> |       |   |                |            |   |                |                |     |           | .147           |                                 |

Table 1. Summary of simulation results for 16 processors.

Sixteen samples were used to determine averages.

### C. AN EMPIRICAL RELATIONSHIP

Aside from confirming intuition, Table 1 appears to show a relationship between sensitivity of Fortran codes and their memory utilization.

Define

Um = average number of memory reads ad writes per processor algorithm execution time per processor

Then the ratio  $S_{al}/\overline{U}_m$  is shown in Table 1 to range over a rather restricted set of values (.552 to .766), across different codes and in the presence of access delays  $\widetilde{D}_{ac}$  which vary over a 5:1 range (.051 to .254) as the number of banks is varied. An approximate sensitivity determined from

 $S_{al} = .65 \overline{U}_{m}$ 

would be within 18% for all cases. The range of this approximation is limited however, if  $S_{al}$  is bounded by unity. The relationship between  $S_{al}$  and  $\overline{U}_{m}$  is felt to be indirect; possibly it is due to the number of ports rather than  $\overline{U}_{m}$  which supply vector operands to the matrix multiply inner loop. If one of these ports is delayed, a "worst-case" delay is imposed on the loop and  $S_{al}$  increases.

### D. CONCLUSIONS

In this section, two results stand out.

- (a) The randomness rather than the size of the access delays have the greatest effects on the critical path.
- (b) If the memory utilization per processor  $\overline{U}_{\rm m}$  is known, the algorithm sensitivity to access delays may be estimated from the rule-of-thumb

<sup>\*</sup>MUL3 is a specially coded CAL routine (see Section III).

s<sub>al</sub> - .65 Ū<sub>m</sub>

for conventional Fortran vector codes. This puts the simpler memory access simulation performed by computer architects on firm grounds, in so far as their ability to predict algorithm delay.

The above conclusions are based on three Fortran codes; this must be acknowledged as a small sample, in spite of the diversity of their access patterns. Also, the vector length was constant at 64; the above formula could also depend on VL, since for a given  $\overline{U}_m$ , the critical path would likely be more disrupted by short vectors.

### III. ALGORITHM DELAY RESULTS

Although the above study of the two components of  $\tilde{D}_{al}$  (=  $\tilde{D}_{ac}$   $S_{al}$ ) may give insight, the algorithm delay  $\tilde{D}_{al}$  itself is ultimately of interest. These are depicted in Figures 2 and 3 and given in Table 2.

Figure 2 gives  $\widetilde{D}_{al}$  when  $R_{bp}$  = 16, the most likely situation. The FFT, MUL, and CFD codes have nearly the same delay between 2% and 3% from 1 to 16 processors, corresponding to their simular  $\overline{U}_{m}$ . MUL<sub>2</sub>, with high access delay and large  $S_{al}$ , has nearly a 5% delay. Their are seemingly no surprises here.

When the number of banks is halved, Figures 3 indicates that  $S_{al}$  of the high access  $MUL_2$  code increases by the greatest ratio (4.1:1). Even the small differences been curves in Figure 2 are magnified in Figure 3. The implication is that, since  $S_{al}$  remains relatively constant (Table 2) as  $R_{bp}$  increases,  $R_{bp} = 16$  is the smallest ratio which avoids the risk of high  $\widetilde{D}_{ac}$ 's with common  $\overline{U}_{m}$ 's.





### IV. CONFLICT-RESISTANT ALGORITHMS

### A. INTRODUCTION

Undoubtedly, the greatest benefit of defining an algorithm sensitivity is in algorithm design. It will be shown possible, with careful control of the data flow in each processor using assembly language (CAL), to defeat the normal relationship between  $D_{ac}$  and  $D_{al}$  and ultimately to reduce the small-delay (incremental) sensitivity to zero. Since assembly coding is a common practice for CRAY-1 and CRAY X-MP library programs, it may take a small additional effort to isolate these workhorse codes from the large delays possibly associated with many-processor architectures.

### B. LOCAL MEMORY UTILIZATION

Two conditions must be met to guarantee code performance resistant to access delays.

- (a) Shared memory access must be off the critical path, and
- (b) Vector access must be on data in conflict-free storage. The vector register set forms such storage on the X-MP; the former can be achieved by pre-fetching operands and post-storing results.

Prefetching is difficult to achieve for general codes, and, where possible, usually requires loop reordering and other instruction scheduling beyond commercially-available compilers. Library programs, which are often built around a small kernel, are candidates for such coding. Fortunately, CRAY-1 experience has shown that prefetching can be completely masked by floating point computation in linear algebra codes without reducing the execution rate; the vector register set is sufficiently large to act as a non-conflict buffer [2].

A matrix-vector multiply (MUL $_3$ ) has been assembly coded with these features; the resultant sensitivities are shown in Table 1. The related and  $S_{al}$ 's are a small fraction (20-25%) of those for the corresponding MUL $_2$  Fortran code, and considerably less than any other kernel in the table.

### C. NON-UNIT STRIDE ACCESS

 $S_{al}$  of the inner loop of MUL $_3$  has a zero value for small delays. However, the CAL implementation that yielded the low  $S_{al}$  of MUL $_3$  in Tables 1 and 2 for 64 × 16 matrix-vector multiplies produced quite different results when 64 × 64 matrices were multiplied, as indicated in Table 3a. The sensitivities  $S_{al}$  increase with number of banks, although  $\overline{D}_{ac}$  and  $\overline{D}_{al}$  decrease individually. The orgin of the problem seems worthy of discussion.

It is a convenience in CAL coding of matrix multiplies and other linear algebra codes to implement

### Y + Y + MX

by loading the elements of X in reverse order into a vector register, and then arranging them as scalars to multply the columns of M. The related assembly code has the form

A1 64
VL A1
VO ,A0, -1

inner
loop S1 . VO, A1

V3 S1\*FV2

| Banks         | Ď <sub>ac</sub> | $\mathfrak{D}_{\mathtt{al}}$ | $s_{\tt al}$ |
|---------------|-----------------|------------------------------|--------------|
|               | (%)             | (%)                          |              |
| Original code |                 |                              |              |
| 16            | 19.4            | 4.84                         | .249         |
| 32            | 5.32            | 1.64                         | .309         |
| 64            | 1.52            | .842                         | .657         |
| Modified Code |                 |                              |              |
| 16            | 17.6            | 1.47                         | .083         |
| 32            | 1.74            | .391                         | .224         |
| 64            | 1.53            | .271                         | .178         |

Table 3. Effect of eliminating counter-grain access in 64 × 64 multiply; p = 4.



Figure 4. Effect of negative-stride access

|                                                                     | 1                                            | R <sub>pb</sub> = 8                          |                                             | R <sub>pi</sub>                              | o = 16                                      |                                      |
|---------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|---------------------------------------------|----------------------------------------------|---------------------------------------------|--------------------------------------|
| 1 processor                                                         | Ď <sub>ac</sub>                              | $\mathfrak{D}_{\mathtt{al}}$                 | Sal                                         | Ď <sub>ac</sub>                              | $\mathfrak{D}_{\mathtt{al}}$                | Sal                                  |
|                                                                     | (%)                                          | (%)                                          |                                             | (%)                                          | (%)                                         |                                      |
| MUL 1<br>MUL 2<br>MUL 3<br>CFD<br>FFT<br>Average*                   |                                              |                                              |                                             | 4.22<br>3.75<br>3.43<br>6.56<br>4.37<br>4.72 | 2.53<br>2.19<br>.76<br>1.85<br>1.72<br>2.07 | .600<br>.584<br>.221<br>.282<br>.393 |
| 2 processor                                                         |                                              |                                              |                                             |                                              |                                             | -                                    |
| MUL <sub>1</sub> MUL <sub>2</sub> MUL <sub>3</sub> CFD FFT Average* | 26.5<br>29.6<br>6.87<br>15.9<br>17.8<br>22.4 | 9.25<br>20.0<br>1.81<br>4.68<br>6.66<br>10.1 | .349<br>.675<br>.263<br>.294<br>.374        | 3.12<br>4.53<br>6.25<br>7.19<br>4.84<br>4.92 | 2.36<br>3.04<br>.56<br>2.21<br>2.15<br>2.44 | .756<br>.671<br>.089<br>.307<br>.444 |
| 4 processor                                                         |                                              |                                              |                                             |                                              |                                             |                                      |
| MUL <sub>1</sub> MUL <sub>2</sub> MUL <sub>3</sub> CFD FFT Average  | 21.2<br>27.9<br>5.15<br>14.3<br>15.0         | 6.61<br>19.8<br>1.12<br>5.15<br>6.73<br>9.57 | .311<br>.709<br>.217<br>.360<br>.448        | 6.09<br>5.62<br>1.87<br>5.93<br>5.46<br>5.77 | 2.61<br>5.28<br>.31<br>1.88<br>2.71<br>3.12 | .429<br>.939<br>.166<br>.317<br>.496 |
| 8 processor                                                         |                                              |                                              |                                             |                                              |                                             |                                      |
| MUL <sub>1</sub> MUL <sub>2</sub> MUL <sub>3</sub> CFD FFT Average* | 16.2<br>25.6<br>7.34<br>12.1<br>14.0         | 6.02<br>19.7<br>1.36<br>4.24<br>6.61<br>9.14 | .371<br>.769<br>.185<br>.350<br><u>.472</u> | 4.69<br>6.87<br>1.87<br>5.00<br>6.56<br>5.78 | 2.27<br>6.14<br>.49<br>1.85<br>3.48<br>3.43 | .484<br>.893<br>.262<br>.370<br>.530 |
| 16 processor                                                        |                                              |                                              |                                             |                                              |                                             |                                      |
| MUL <sub>1</sub> MUL <sub>2</sub> MUL <sub>3</sub> CFD FFT Average* | 16.9<br>25.3<br>6.56<br>12.3<br>16.9         | 6.51<br>21.9<br>1.04<br>4.30<br>8.16<br>10.2 | .385<br>.865<br>.158<br>.349<br>.482        | 5.00<br>5.93<br>2.65<br>6.09<br>7.65<br>6.17 | 2.51<br>5.09<br>.40<br>2.29<br>3.83<br>3.43 | .502<br>.858<br>.151<br>.376<br>.500 |

<sup>\*</sup>MUL3 not included in averages

Table 2. Delay and sensitivity summary for two  $R_{\mbox{\scriptsize bp}}$  ratios. X-MP-2 protocol.

A delay in the load of VO may delay the first trip through the inner loop if VL is sufficiently long, since the load of S1 will not chain off the read. Worse, the read has a negative unit stride, whereas all other accesses have positive unit strides. Figure 4 shows the effects of such an access on the other highly-regular accesses. Access Z3, beginning at clock 5588, intesects and delays seven other accesses, two of them twice. It is evident that a window of accesses extending approximately -VL and +2VL clocks from the initiation of VO is potentially affected by such a counter-grain access. The access Z3 itself is delayed by 28 clocks.

When the access is replaced by a positive unit-stride access, the low sensitivity of the modified code of Table 3 is obtained. The  $\tilde{D}_{al}$  is reduced to insignificant levels (.272%) for a typical  $R_{bp}$  = 16, and retains these levels (1.47%) when the number of banks is reduced by a further factor of 4!

### REFERENCES

- [1] Calahan, D. A., and K. E. Elliott III, "Memory Conflict Simulation of Many-Processor CRAY Architecture. Part I: A CRAY X-MP Study," Report SARL #6, Supercomputer Algorithm Research Laboratory, Department of Electrical Engineering and Computer Sciences, University of Michigan, March 1, 1985.
- [2] Calahan, D. A. et al, "Sparse Matrix and Other High Performance Algorithms for the CRAY-1," Report #124, Systems Engineering Laboratory, Department of Electrical Engineering, University of Michigan, January, 1979.

### APPENDIX A

### EXPERIMENT DESCRIPTION

### EXPERIMENTAL PARAMETERS

The codes were produced by the X-MP CFT compiler from Fortran source codes. Vector length (VL) is 64 and stride is 1 for all cases.

Distinct program and data storage was used for each of the 16 processors. Code executions were initiated at irregular intervals to further randomize accesses between processors. In general, p samples were used to produce mean values with p procesors.

Two global static measures of memory accesses were made to monitor their uniformity.

(a) Memory utilization. This is the fraction

$$\overline{U}_{m} = \frac{\text{Total operands and results}}{\text{Simulation time (CP's)}}$$

for the average processor; it is a measure of memory traffic for each code, and has a maximum value of 3, corresponding to the number of memory ports per processor. Table 1 shows  $\overline{U}_{m} \approx .67$  for FFT, CFD, and MUL<sub>1</sub>.

(b) Bank utilization. Let  $N_b$  be the number of banks. There is a risk with 64-length unit-stride vectors and  $N_b$  > 64 that banks will not be equally utilized; this would create uncharacteristic delays in heavily-utilized banks. If  $\widetilde{N}$  is the average number of accesses per processor across all banks, and N is the standard deviation from this average, define the bank utilization

$$\overline{U}_b = \frac{\overline{N} - N}{\overline{N}}.$$

 $\overline{U}_b$  = 1 indicates uniform accessing; if only 1/2 of the banks are accessed,  $\overline{U}_b$  = 1/2. Table 1 indicates .832  $<\overline{U}_b$  < .998..

### CODE DESCRIPTIONS

F

Ë

- (a) Fluids kernel (CFD). Taken from the vectorized code of [7], this is a 32-statement single-loop Fortran kernel with an average of 3.2 64-length vector-vector operations/statement. Lack of a repetitive computational structure like FFT and MUL should make the access pattern the most random. Six buffer fetches occur in one kernel execution.
- (b) FFT kernel (FFT). This code determines multiple 8-point complex-complex FFT's. Five buffer fetches occur in one kernel execution.
- (c) Matrix-vector multiply kernel (MUL<sub>1</sub>, MUL<sub>2</sub>, MUL<sub>3</sub>). The inner-loop of MUL<sub>1</sub> and MUL<sub>2</sub> has two vector reads and one write per execution. MUL<sub>1</sub> maintains low memory utilization ( $U_m = .69$ ) with VL = 64 by multiplying 4 small (64 x 3) matrices in one kernel execution step; MUL<sub>2</sub> uses the same code with 512×2 matrices, which successively exercises the inner-loop 512/64 = 8 times, and achieves  $\overline{U}_m = 1.58$ , a value more characteristic of a large Fortran-coded matrix multiply on the X-MP. No buffer fetches occur in consecutive executions of the kernel. The inner loop of MUL<sub>3</sub> has one pre-fetched vector read per inner loop execution.

## END

### FILMED

11-85

DTIC